Understanding our Data

  • Last week we introduced some of the key motivations behind Environmental Statistics.

  • The course will cover a number of statistical ideas around the general theme of environmental data.

  • This week we will be looking at uncertainty and variability, and how we can measure these and incorporate them into our conclusions.

  • We will then look at a number of important features of environmental data — censoring, outliers and missing data.

Uncertainty and Variability

Uncertainty and Error

  • We often talk about uncertainty and error as though they are interchangeable, but this is not quite correct.

  • Error is the difference between the measured value and the “true value” of the thing being measured.

  • Uncertainty is a quantification of the variability of the measurement result.

  • Practically speaking, we make use of common statistical distributions to account for uncertainty.

Recap: Continuous Distributions

A continuous random variable \(X\) follows a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) if its probability density function (pdf) is:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

We denote this as:

\[ X \sim \mathcal{N}(\mu, \sigma^2), ~\text{where} ~ -\infty < X < +\infty \]

Why can’t we just use normal distributions for all environmental data?

A random variable \(X\) follows a log-normal distribution if \(\ln(X)\) follows a normal distribution,i.e.

\[ Y = \ln(X) \sim \mathcal{N}(\mu, \sigma^2) \quad \text{where}~ Y\in (0, +\infty) \]

A random variable \(X\) follows an exponential distribution with rate parameter \(\lambda >0\) if its probability density function (pdf) is:

\[ f(x; \lambda) = \begin{cases} \lambda e^{-\lambda x} & \text{for } x \geq 0 \\ 0 & \text{for } x < 0 \end{cases} \]

\(\lambda\) describes the rate of events, i.e., the no. of events per unit time/distance

  • Higher \(\lambda\) = more frequent events
  • Mean waiting time: \(E[X] = \frac{1}{\lambda}\) (e.g., \(\lambda = 0.2\) rainfall events/hour \(\rightarrow\) Mean time between events = 5 hours)
  • Variance: \(Var(X) = \frac{1}{\lambda^2}\)

Recap: Discrete Distributions

A discrete random variable \(X\) follows a Poisson distribution with rate parameter \(\lambda > 0\) if its probability mass function (PMF) is:

\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, ~ k = 0, 1, \dots \]

We denote this as \(X \sim Po(\lambda)\) where \(\lambda\) describes:

  • Expected number of events in a fixed interval
  • Mean events per unit time/area/volume
  • Example: \(\lambda = 3.2\) means 3.2 events expected on average

A discrete random variable \(X\) follows a binomial distribution with parameters \(n\) and \(p\) if:

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \dots, n \]

We denote this as \(X \sim Bi(n, p)\) where:

  • \(n\) = number of independent trials

  • \(p\) = probability of success in each trial

  • \(k\) = number of successes observed

Survival studies: \(n\) animals, each with survival probability \(p\)

Detection/non-detection: \(n\) surveys, probability \(p\) of detecting species

A discrete random variable \(X\) follows a negative binomial distribution with parameters \(r\) and \(p\) if:

\[ P(X = k) = \binom{k + r - 1}{k} (1-p)^r p^k, ~ k = 0, 1, \dots \]

The distribution of the number of trials until the \(r\)th success is denoted by \(X\sim \mathrm{NegBi}(r,p)\) Where

  • \(r\) = number of failures
  • \(p\) = probability of success on each trial
  • \(k\) = number of successes

Example: Bathing Water Quality

  • All bathing water sites in Scotland are classified by SEPA as “Excellent”, “Good”, “Sufficient” or “Poor” in terms of how much faecal bacteria (from sewage) they contain.

  • The minimum standard all beaches or bathing water must meet is “Sufficient”.

  • The sites are classified based on the 90th and 95th percentiles of samples taken over the four most recent bathing seasons.

Example: Bathing Water Quality

Green is excellent , blue is good, red is sufficient

Example: bathing water quality

  • The classification system assumes that bacterial concentrations at each site follow a log-normal distribution.

  • If this assumption does not hold, the classifications would not be accurate.

  • Therefore, it is crucial that we regularly assess this assumption to ensure the safety of our bathing water.

Example: bathing water quality

  • We can use our standard residual plots to assess log-normality.

  • The top plots show the standard residuals and the bottom plots show the residuals for the log-transformed data.

  • There is no strong evidence to suggest we have breached our assumptions.

Error in Environmental Measurements

Error in a measurement is the difference between the measured value and the true value.

  • Error may include both random and systematic components.

Random error: Variation observed randomly over repeat measurements.
→ With more measurements, these errors average out (improves accuracy).

Systematic Error

Systematic error: Variation that remains constant over repeated measures.

  • Typically due to some feature of the measurement process.
  • Making more measurements will not improve accuracy (all affected equally).
  • Can only be eliminated by identifying and correcting the cause.

Error Identification Exercise

For each example, identify whether the error is random or systematic:

  1. A meter reads 0.01 even when measuring no sample.
    Hint: Constant offset regardless of measurement…

  2. An old thermometer can only measure to the nearest 0.5 degrees.
    Hint: Precision limitation…

  3. A poorly designed rainfall monitor often leaks water on windy days.
    Hint: Specific condition causing consistent bias…

  4. To estimate the abundance of a fish species in a lake, scientists use a net with a mesh size equal to the average fish length
    Hint: Only fishes up to a given size can be caught

Discuss with a neighbor!

Answers & Discussion

  1. Systematic - Constant offset (bias)
  2. Random - Precision limitation (rounding error varies)
  3. Systematic - Consistent bias under specific conditions
  4. Systematic - All measurements affected by melting

Key takeaway: Random errors can be reduced by averaging; systematic errors require calibration, better instruments, or method changes.